Desafio BIX

Data Science aplicada a otimização do planejamento da manutenção

Análise realizada por Marcella Café

In [1]:
# Instalação da biblioteca DMwR que não está mais disponível no CRAN. Para instalá-la é preciso ter o pacote devtools instalado.
# devtools::install_github("cran/DMwR")

options(scipen = 999) # Retira a notacao cientifica

# Carregando as bibliotecas utilizadas
library(caret)
library(corrplot)
library(DMwR)
library(e1071)
library(factoextra)
library(FactoMineR)
library(mice)
library(naniar)
library(plotly)
library(qcc)
library(randomForest)
library(reactable)
library(readxl)
library(ROSE)
library(tidyverse)
library(VIM)
library(xgboost)
Loading required package: lattice

Loading required package: ggplot2

corrplot 0.88 loaded

Loading required package: grid

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa


Attaching package: 'mice'


The following object is masked from 'package:stats':

    filter


The following objects are masked from 'package:base':

    cbind, rbind



Attaching package: 'plotly'


The following object is masked from 'package:ggplot2':

    last_plot


The following object is masked from 'package:stats':

    filter


The following object is masked from 'package:graphics':

    layout


Package 'qcc' version 2.7

Type 'citation("qcc")' for citing this R package in publications.

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'


The following object is masked from 'package:ggplot2':

    margin


Loaded ROSE 0.0-3


-- Attaching packages --------------------------------------- tidyverse 1.3.0 --

v tibble  3.1.1     v dplyr   1.0.5
v tidyr   1.0.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0
v purrr   0.3.3     

-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::combine()       masks randomForest::combine()
x dplyr::filter()        masks plotly::filter(), mice::filter(), stats::filter()
x dplyr::lag()           masks stats::lag()
x purrr::lift()          masks caret::lift()
x randomForest::margin() masks ggplot2::margin()

Loading required package: colorspace

Loading required package: data.table


Attaching package: 'data.table'


The following objects are masked from 'package:dplyr':

    between, first, last


The following object is masked from 'package:purrr':

    transpose


VIM is ready to use. 
 Since version 4.0.0 the GUI is in its own package VIMGUI.

          Please use the package to use the new (and old) GUI.


Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues


Attaching package: 'VIM'


The following object is masked from 'package:DMwR':

    kNN


The following object is masked from 'package:datasets':

    sleep



Attaching package: 'xgboost'


The following object is masked from 'package:dplyr':

    slice


The following object is masked from 'package:plotly':

    slice


Carregando a base de dados

O primeiro passo a ser tomado será carregar a base de dados de treino e de teste. O conjunto de dados contém 60.000 dados e 171 variáveis. 59.000 dados do conjunto de dados pertencem à classe negativa e 1000 à classe positiva. O conjunto de teste contém 16.000 dados e 171 variáveis.

In [3]:
# Carregando a base de dados de treino (anos anteriores) e teste (2020)
df_train <- read.csv("base_vigente_anos_anteriores.csv", header = T, na.strings = "na")
df_test <- read.csv("base_vigente_2020.csv", header = T, na.strings = "na")

# Checando a dimensão da base de dados de treino
dim(df_train)
  1. 60000
  2. 171
In [4]:
# Verificando os primeiros dados da base de dados de treino
head(df_train)
A data.frame: 6 × 171
classaa_000ab_000ac_000ad_000ae_000af_000ag_000ag_001ag_002...ee_002ee_003ee_004ee_005ee_006ee_007ee_008ee_009ef_000eg_000
<fct><int><int><int><dbl><int><int><int><int><int>...<int><int><int><int><int><int><int><int><int><int>
1neg76698NA21307064382800 0000...1240520493384721044469792339156157956 73224 00 0
2neg33058NA 0 NA0 0000... 421400178064293306245416133654 81140 9757615000 0
3neg41040NA 2281000 0000... 277378159812423992409564320746158022 95128 5140 0
4neg 12 0 70 66010000... 240 46 58 44 10 0 0 0432
5neg60874NA 13684580 0000... 62201222979040529834718828695431156043395412180 0
6neg38312NA21307064322180 0000... 388574288278900430300412 1534 338 856 00 0
In [5]:
# Verificando o nome das colunas do dataset de treino
dput(colnames(df_train))
c("class", "aa_000", "ab_000", "ac_000", "ad_000", "ae_000", 
"af_000", "ag_000", "ag_001", "ag_002", "ag_003", "ag_004", "ag_005", 
"ag_006", "ag_007", "ag_008", "ag_009", "ah_000", "ai_000", "aj_000", 
"ak_000", "al_000", "am_0", "an_000", "ao_000", "ap_000", "aq_000", 
"ar_000", "as_000", "at_000", "au_000", "av_000", "ax_000", "ay_000", 
"ay_001", "ay_002", "ay_003", "ay_004", "ay_005", "ay_006", "ay_007", 
"ay_008", "ay_009", "az_000", "az_001", "az_002", "az_003", "az_004", 
"az_005", "az_006", "az_007", "az_008", "az_009", "ba_000", "ba_001", 
"ba_002", "ba_003", "ba_004", "ba_005", "ba_006", "ba_007", "ba_008", 
"ba_009", "bb_000", "bc_000", "bd_000", "be_000", "bf_000", "bg_000", 
"bh_000", "bi_000", "bj_000", "bk_000", "bl_000", "bm_000", "bn_000", 
"bo_000", "bp_000", "bq_000", "br_000", "bs_000", "bt_000", "bu_000", 
"bv_000", "bx_000", "by_000", "bz_000", "ca_000", "cb_000", "cc_000", 
"cd_000", "ce_000", "cf_000", "cg_000", "ch_000", "ci_000", "cj_000", 
"ck_000", "cl_000", "cm_000", "cn_000", "cn_001", "cn_002", "cn_003", 
"cn_004", "cn_005", "cn_006", "cn_007", "cn_008", "cn_009", "co_000", 
"cp_000", "cq_000", "cr_000", "cs_000", "cs_001", "cs_002", "cs_003", 
"cs_004", "cs_005", "cs_006", "cs_007", "cs_008", "cs_009", "ct_000", 
"cu_000", "cv_000", "cx_000", "cy_000", "cz_000", "da_000", "db_000", 
"dc_000", "dd_000", "de_000", "df_000", "dg_000", "dh_000", "di_000", 
"dj_000", "dk_000", "dl_000", "dm_000", "dn_000", "do_000", "dp_000", 
"dq_000", "dr_000", "ds_000", "dt_000", "du_000", "dv_000", "dx_000", 
"dy_000", "dz_000", "ea_000", "eb_000", "ec_00", "ed_000", "ee_000", 
"ee_001", "ee_002", "ee_003", "ee_004", "ee_005", "ee_006", "ee_007", 
"ee_008", "ee_009", "ef_000", "eg_000")
In [6]:
# Verificando o resumo da base de dados de treino
summary(df_train)
 class           aa_000            ab_000           ac_000          
 neg:59000   Min.   :      0   Min.   :  0.00   Min.   :         0  
 pos: 1000   1st Qu.:    834   1st Qu.:  0.00   1st Qu.:        16  
             Median :  30776   Median :  0.00   Median :       152  
             Mean   :  59337   Mean   :  0.71   Mean   : 356014263  
             3rd Qu.:  48668   3rd Qu.:  0.00   3rd Qu.:       964  
             Max.   :2746564   Max.   :204.00   Max.   :2130706796  
                               NA's   :46329    NA's   :3335        
     ad_000               ae_000              af_000             ag_000       
 Min.   :         0   Min.   :    0.000   Min.   :    0.00   Min.   :      0  
 1st Qu.:        24   1st Qu.:    0.000   1st Qu.:    0.00   1st Qu.:      0  
 Median :       126   Median :    0.000   Median :    0.00   Median :      0  
 Mean   :    190621   Mean   :    6.819   Mean   :   11.01   Mean   :    222  
 3rd Qu.:       430   3rd Qu.:    0.000   3rd Qu.:    0.00   3rd Qu.:      0  
 Max.   :8584297742   Max.   :21050.000   Max.   :20070.00   Max.   :3376892  
 NA's   :14861        NA's   :2500        NA's   :2500       NA's   :671      
     ag_001            ag_002             ag_003             ag_004         
 Min.   :      0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:      0   1st Qu.:       0   1st Qu.:       0   1st Qu.:      308  
 Median :      0   Median :       0   Median :       0   Median :     3672  
 Mean   :    976   Mean   :    8606   Mean   :   88591   Mean   :   437097  
 3rd Qu.:      0   3rd Qu.:       0   3rd Qu.:       0   3rd Qu.:    49522  
 Max.   :4109372   Max.   :10552856   Max.   :63402074   Max.   :228830570  
 NA's   :671       NA's   :671        NA's   :671        NA's   :671        
     ag_005              ag_006             ag_007             ag_008        
 Min.   :        0   Min.   :       0   Min.   :       0   Min.   :       0  
 1st Qu.:    13834   1st Qu.:   10608   1st Qu.:       0   1st Qu.:       0  
 Median :   176020   Median :  930336   Median :  119204   Median :    1786  
 Mean   :  1108374   Mean   : 1657818   Mean   :  499310   Mean   :   35570  
 3rd Qu.:   913964   3rd Qu.: 1886608   3rd Qu.:  588820   3rd Qu.:   26690  
 Max.   :179187978   Max.   :94020666   Max.   :63346754   Max.   :17702522  
 NA's   :671         NA's   :671        NA's   :671        NA's   :671       
     ag_009             ah_000             ai_000             aj_000       
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :      0  
 1st Qu.:       0   1st Qu.:   29733   1st Qu.:       0   1st Qu.:      0  
 Median :       0   Median : 1002420   Median :       0   Median :      0  
 Mean   :    5115   Mean   : 1809931   Mean   :    9017   Mean   :   1144  
 3rd Qu.:     364   3rd Qu.: 1601366   3rd Qu.:       0   3rd Qu.:      0  
 Max.   :25198514   Max.   :74247318   Max.   :16512852   Max.   :5629340  
 NA's   :671        NA's   :645        NA's   :629        NA's   :629      
     ak_000             al_000              am_0              an_000         
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:       0   1st Qu.:       0   1st Qu.:       0   1st Qu.:    73238  
 Median :       0   Median :       0   Median :       0   Median :  1918629  
 Mean   :     979   Mean   :   59130   Mean   :   93281   Mean   :  3461037  
 3rd Qu.:       0   3rd Qu.:    1204   3rd Qu.:    2364   3rd Qu.:  3128416  
 Max.   :10444924   Max.   :34762578   Max.   :55903508   Max.   :140861830  
 NA's   :4400       NA's   :642        NA's   :629        NA's   :642        
     ao_000              ap_000             aq_000             ar_000        
 Min.   :        0   Min.   :       0   Min.   :       0   Min.   :  0.0000  
 1st Qu.:    65585   1st Qu.:   25189   1st Qu.:    4161   1st Qu.:  0.0000  
 Median :  1643556   Median :  357281   Median :  178792   Median :  0.0000  
 Mean   :  3002440   Mean   : 1004160   Mean   :  442404   Mean   :  0.4969  
 3rd Qu.:  2675796   3rd Qu.:  724660   3rd Qu.:  376900   3rd Qu.:  0.0000  
 Max.   :122201822   Max.   :77934944   Max.   :25562646   Max.   :350.0000  
 NA's   :589         NA's   :642        NA's   :589        NA's   :2723      
     as_000              at_000             au_000              av_000      
 Min.   :      0.0   Min.   :       0   Min.   :      0.0   Min.   :     0  
 1st Qu.:      0.0   1st Qu.:       0   1st Qu.:      0.0   1st Qu.:    12  
 Median :      0.0   Median :       0   Median :      0.0   Median :   116  
 Mean   :    126.7   Mean   :    5072   Mean   :    230.6   Mean   :  1118  
 3rd Qu.:      0.0   3rd Qu.:       0   3rd Qu.:      0.0   3rd Qu.:   646  
 Max.   :1655240.0   Max.   :10400504   Max.   :2626676.0   Max.   :794458  
 NA's   :629         NA's   :629        NA's   :629         NA's   :2500    
     ax_000             ay_000             ay_001             ay_002        
 Min.   :     0.0   Min.   :       0   Min.   :       0   Min.   :       0  
 1st Qu.:    10.0   1st Qu.:       0   1st Qu.:       0   1st Qu.:       0  
 Median :    66.0   Median :       0   Median :       0   Median :       0  
 Mean   :   374.3   Mean   :   12212   Mean   :   10190   Mean   :   10975  
 3rd Qu.:   263.0   3rd Qu.:       0   3rd Qu.:       0   3rd Qu.:       0  
 Max.   :116652.0   Max.   :50553892   Max.   :80525378   Max.   :28474838  
 NA's   :2501       NA's   :671        NA's   :671        NA's   :671       
     ay_003             ay_004             ay_005              ay_006         
 Min.   :       0   Min.   :       0   Min.   :        0   Min.   :        0  
 1st Qu.:       0   1st Qu.:       0   1st Qu.:        0   1st Qu.:        0  
 Median :       0   Median :       0   Median :        0   Median :   168202  
 Mean   :    7226   Mean   :   10566   Mean   :   111979   Mean   :  1078551  
 3rd Qu.:       0   3rd Qu.:       0   3rd Qu.:    40132   3rd Qu.:  1270244  
 Max.   :13945170   Max.   :40028704   Max.   :124948914   Max.   :127680326  
 NA's   :671        NA's   :671        NA's   :671         NA's   :671        
     ay_007              ay_008              ay_009             az_000        
 Min.   :        0   Min.   :        0   Min.   :       0   Min.   :       0  
 1st Qu.:     6118   1st Qu.:     7524   1st Qu.:       0   1st Qu.:    1028  
 Median :   348622   Median :    94812   Median :       0   Median :    2098  
 Mean   :  1546032   Mean   :  1051123   Mean   :    1163   Mean   :    7850  
 3rd Qu.:  1337364   3rd Qu.:   611816   3rd Qu.:       0   3rd Qu.:    4150  
 Max.   :489678156   Max.   :104566992   Max.   :18824656   Max.   :10124620  
 NA's   :671         NA's   :671         NA's   :671        NA's   :671       
     az_001            az_002             az_003             az_004         
 Min.   :      0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:     60   1st Qu.:      90   1st Qu.:     294   1st Qu.:     1542  
 Median :    636   Median :    1016   Median :    3570   Median :    81614  
 Mean   :   4421   Mean   :    8066   Mean   :   87241   Mean   :  1476897  
 3rd Qu.:   2016   3rd Qu.:    3136   3rd Qu.:   43014   3rd Qu.:  1774410  
 Max.   :4530258   Max.   :14217662   Max.   :45584242   Max.   :123047106  
 NA's   :671       NA's   :671        NA's   :671        NA's   :671        
     az_005              az_006             az_007             az_008         
 Min.   :        0   Min.   :       0   Min.   :       0   Min.   :      0.0  
 1st Qu.:    38638   1st Qu.:      10   1st Qu.:       0   1st Qu.:      0.0  
 Median :   527034   Median :     292   Median :       0   Median :      0.0  
 Mean   :  2135584   Mean   :  101894   Mean   :   17378   Mean   :    661.8  
 3rd Qu.:  1796172   3rd Qu.:    4218   3rd Qu.:       0   3rd Qu.:      0.0  
 Max.   :467832334   Max.   :64589140   Max.   :39158218   Max.   :1947884.0  
 NA's   :671         NA's   :671        NA's   :671        NA's   :671        
     az_009             ba_000              ba_001              ba_002        
 Min.   :     0.0   Min.   :        0   Min.   :        0   Min.   :       0  
 1st Qu.:     0.0   1st Qu.:    33478   1st Qu.:    14834   1st Qu.:    5208  
 Median :     0.0   Median :   679486   Median :   443641   Median :  186046  
 Mean   :    42.1   Mean   :  1399652   Mean   :   894117   Mean   :  413097  
 3rd Qu.:     0.0   3rd Qu.:  1276368   3rd Qu.:   811000   3rd Qu.:  340408  
 Max.   :666148.0   Max.   :232871714   Max.   :116283282   Max.   :55807388  
 NA's   :671        NA's   :688         NA's   :688         NA's   :688       
     ba_003             ba_004             ba_005             ba_006        
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
 1st Qu.:    1842   1st Qu.:     612   1st Qu.:     374   1st Qu.:     352  
 Median :  134149   Median :  101874   Median :   83992   Median :   70116  
 Mean   :  274007   Mean   :  204876   Mean   :  188941   Mean   :  210629  
 3rd Qu.:  244449   3rd Qu.:  197162   3rd Qu.:  184674   3rd Qu.:  204959  
 Max.   :36931418   Max.   :25158556   Max.   :19208664   Max.   :18997660  
 NA's   :688        NA's   :688        NA's   :688        NA's   :688       
     ba_007             ba_008             ba_009             bb_000         
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:      72   1st Qu.:       0   1st Qu.:       0   1st Qu.:   105601  
 Median :    4470   Median :      22   Median :       0   Median :  2360728  
 Mean   :  185787   Mean   :   35883   Mean   :   35767   Mean   :  4526177  
 3rd Qu.:  207471   3rd Qu.:    1820   3rd Qu.:      60   3rd Qu.:  3868370  
 Max.   :14314086   Max.   :31265984   Max.   :43706408   Max.   :192871534  
 NA's   :688        NA's   :688        NA's   :688        NA's   :645        
     bc_000             bd_000             be_000           bf_000        
 Min.   :     0.0   Min.   :     0.0   Min.   :     0   Min.   :    0.00  
 1st Qu.:     0.0   1st Qu.:     8.0   1st Qu.:    18   1st Qu.:    0.00  
 Median :    16.0   Median :    66.0   Median :   180   Median :    2.00  
 Mean   :   569.5   Mean   :   921.8   Mean   :  1373   Mean   :   74.88  
 3rd Qu.:   136.0   3rd Qu.:   438.0   3rd Qu.:   614   3rd Qu.:   18.00  
 Max.   :396952.0   Max.   :306452.0   Max.   :810568   Max.   :51050.00  
 NA's   :2725       NA's   :2727       NA's   :2503     NA's   :2500      
     bg_000             bh_000            bi_000             bj_000        
 Min.   :       0   Min.   :      0   Min.   :       0   Min.   :       0  
 1st Qu.:   29743   1st Qu.:    852   1st Qu.:   15947   1st Qu.:    8522  
 Median : 1002718   Median :  26352   Median :  179842   Median :  154404  
 Mean   : 1809431   Mean   :  57943   Mean   :  492208   Mean   :  510089  
 3rd Qu.: 1602766   3rd Qu.:  49086   3rd Qu.:  379610   3rd Qu.:  333608  
 Max.   :74247318   Max.   :3200582   Max.   :44937496   Max.   :45736316  
 NA's   :642        NA's   :642       NA's   :589        NA's   :589       
     bk_000            bl_000            bm_000            bn_000       
 Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
 1st Qu.: 162720   1st Qu.: 170540   1st Qu.: 172210   1st Qu.: 171720  
 Median : 210660   Median : 222540   Median : 239140   Median : 251400  
 Mean   : 280429   Mean   : 321354   Mean   : 399603   Mean   : 463711  
 3rd Qu.: 281115   3rd Qu.: 303150   3rd Qu.: 369100   3rd Qu.: 493100  
 Max.   :1310700   Max.   :1310700   Max.   :1310700   Max.   :1310700  
 NA's   :23034     NA's   :27277     NA's   :39549     NA's   :44009    
     bo_000            bp_000            bq_000            br_000       
 Min.   :      0   Min.   :      0   Min.   :      0   Min.   :      0  
 1st Qu.: 170550   1st Qu.: 172170   1st Qu.: 170420   1st Qu.: 169470  
 Median : 270660   Median : 288320   Median : 305100   Median : 320400  
 Mean   : 513148   Mean   : 551390   Mean   : 582871   Mean   : 604887  
 3rd Qu.:1310700   3rd Qu.:1310700   3rd Qu.:1310700   3rd Qu.:1310700  
 Max.   :1310700   Max.   :1310700   Max.   :1310700   Max.   :1310700  
 NA's   :46333     NA's   :47740     NA's   :48722     NA's   :49264    
     bs_000            bt_000              bu_000              bv_000         
 Min.   :      0   Min.   :      0.0   Min.   :        0   Min.   :        0  
 1st Qu.:  17300   1st Qu.:    862.8   1st Qu.:   105444   1st Qu.:   105444  
 Median :  50540   Median :  30839.9   Median :  2359656   Median :  2359656  
 Mean   :  80361   Mean   :  59416.5   Mean   :  4515325   Mean   :  4515325  
 3rd Qu.: 118635   3rd Qu.:  48787.9   3rd Qu.:  3863322   3rd Qu.:  3863322  
 Max.   :1037240   Max.   :2746564.8   Max.   :192871534   Max.   :192871534  
 NA's   :726       NA's   :167         NA's   :691         NA's   :691        
     bx_000              by_000            bz_000             ca_000      
 Min.   :      172   Min.   :      0   Min.   :       0   Min.   :     0  
 1st Qu.:    89649   1st Qu.:    216   1st Qu.:       6   1st Qu.:  6886  
 Median :  2258824   Median :  12628   Median :    1036   Median : 25436  
 Mean   :  4112218   Mean   :  22029   Mean   :  101961   Mean   : 39169  
 3rd Qu.:  3645960   3rd Qu.:  20348   3rd Qu.:   13674   3rd Qu.: 68005  
 Max.   :186353854   Max.   :1002003   Max.   :40542588   Max.   :120956  
 NA's   :3257        NA's   :473       NA's   :2723       NA's   :4356    
     cb_000            cc_000              cd_000            ce_000       
 Min.   :      0   Min.   :        0   Min.   :1209600   Min.   :      0  
 1st Qu.:  77125   1st Qu.:    62416   1st Qu.:1209600   1st Qu.:    266  
 Median : 278990   Median :  2108912   Median :1209600   Median :   3409  
 Mean   : 405638   Mean   :  3803444   Mean   :1209600   Mean   :  64344  
 3rd Qu.: 704580   3rd Qu.:  3364634   3rd Qu.:1209600   3rd Qu.:  87236  
 Max.   :1209520   Max.   :148615188   Max.   :1209600   Max.   :4908098  
 NA's   :726       NA's   :3255        NA's   :676       NA's   :2502     
     cf_000               cg_000             ch_000          ci_000         
 Min.   :         0   Min.   :    0.00   Min.   :0       Min.   :        0  
 1st Qu.:         0   1st Qu.:    8.00   1st Qu.:0       1st Qu.:    48250  
 Median :         2   Median :   46.00   Median :0       Median :  1858641  
 Mean   :    190222   Mean   :   91.52   Mean   :0       Mean   :  3481204  
 3rd Qu.:         2   3rd Qu.:  104.00   3rd Qu.:0       3rd Qu.:  2947266  
 Max.   :8584297736   Max.   :21400.00   Max.   :2       Max.   :140986130  
 NA's   :14861        NA's   :14861      NA's   :14861   NA's   :338        
     cj_000             ck_000             cl_000           cm_000       
 Min.   :       0   Min.   :       0   Min.   :     0   Min.   :    0.0  
 1st Qu.:       0   1st Qu.:   14587   1st Qu.:     0   1st Qu.:    0.0  
 Median :       0   Median :  250267   Median :     0   Median :    8.0  
 Mean   :  102842   Mean   :  714343   Mean   :   343   Mean   :  343.1  
 3rd Qu.:       0   3rd Qu.:  549352   3rd Qu.:     2   3rd Qu.:  100.0  
 Max.   :60949671   Max.   :55428669   Max.   :130560   Max.   :73370.0  
 NA's   :338        NA's   :338        NA's   :9553     NA's   :9877     
     cn_000            cn_001             cn_002             cn_003        
 Min.   :      0   Min.   :       0   Min.   :       0   Min.   :       0  
 1st Qu.:      0   1st Qu.:       0   1st Qu.:       0   1st Qu.:    4622  
 Median :      0   Median :       0   Median :       0   Median :   34994  
 Mean   :   2337   Mean   :   21951   Mean   :  161051   Mean   :  531478  
 3rd Qu.:      0   3rd Qu.:       0   3rd Qu.:    7936   3rd Qu.:  232066  
 Max.   :6278490   Max.   :14512994   Max.   :58508606   Max.   :94979324  
 NA's   :687       NA's   :687        NA's   :687        NA's   :687       
     cn_004              cn_005              cn_006             cn_007        
 Min.   :        0   Min.   :        0   Min.   :       0   Min.   :       0  
 1st Qu.:    19114   1st Qu.:     5028   1st Qu.:     626   1st Qu.:      62  
 Median :   518462   Median :   703524   Median :   96266   Median :    9976  
 Mean   :  1282835   Mean   :  1341059   Mean   :  410564   Mean   :   64425  
 3rd Qu.:  1207694   3rd Qu.:  1519808   3rd Qu.:  449060   3rd Qu.:   31074  
 Max.   :169869316   Max.   :117815764   Max.   :72080406   Max.   :33143734  
 NA's   :687         NA's   :687         NA's   :687        NA's   :687       
     cn_008            cn_009             co_000               cp_000        
 Min.   :      0   Min.   :       0   Min.   :         0   Min.   :     0.0  
 1st Qu.:      0   1st Qu.:       0   1st Qu.:         0   1st Qu.:     4.0  
 Median :   1852   Median :      24   Median :         8   Median :    14.0  
 Mean   :  19227   Mean   :    7820   Mean   :    190516   Mean   :   570.4  
 3rd Qu.:   5286   3rd Qu.:     294   3rd Qu.:        72   3rd Qu.:    82.0  
 Max.   :7541716   Max.   :36398374   Max.   :8584297742   Max.   :496360.0  
 NA's   :687       NA's   :687        NA's   :14861        NA's   :2724      
     cq_000              cr_000             cs_000           cs_001        
 Min.   :        0   Min.   :    0.00   Min.   :     0   Min.   :     0.0  
 1st Qu.:   105444   1st Qu.:    0.00   1st Qu.:  1232   1st Qu.:    32.0  
 Median :  2359656   Median :    0.00   Median :  3192   Median :   360.0  
 Mean   :  4515325   Mean   :   37.06   Mean   :  5480   Mean   :   788.4  
 3rd Qu.:  3863322   3rd Qu.:    0.00   3rd Qu.:  5686   3rd Qu.:   692.0  
 Max.   :192871534   Max.   :57450.00   Max.   :839240   Max.   :438806.0  
 NA's   :691         NA's   :46329      NA's   :669      NA's   :669       
     cs_002             cs_003             cs_004             cs_005         
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:     222   1st Qu.:    3079   1st Qu.:    2726   1st Qu.:    19186  
 Median :   20570   Median :  121780   Median :   91080   Median :  1220860  
 Mean   :  238811   Mean   :  355373   Mean   :  444228   Mean   :  2235387  
 3rd Qu.:   94924   3rd Qu.:  295909   3rd Qu.:  208500   3rd Qu.:  2049318  
 Max.   :46085940   Max.   :42421854   Max.   :74860628   Max.   :379142116  
 NA's   :669        NA's   :669        NA's   :669        NA's   :669        
     cs_006             cs_007             cs_008              cs_009        
 Min.   :       0   Min.   :       0   Min.   :      0.0   Min.   :       0  
 1st Qu.:   13360   1st Qu.:    1204   1st Qu.:      2.0   1st Qu.:       0  
 Median :  240744   Median :    6104   Median :     46.0   Median :       0  
 Mean   :  545774   Mean   :   14771   Mean   :    211.7   Mean   :     779  
 3rd Qu.:  686260   3rd Qu.:   18164   3rd Qu.:    148.0   3rd Qu.:       0  
 Max.   :73741974   Max.   :12884218   Max.   :1584558.0   Max.   :44902992  
 NA's   :669        NA's   :669        NA's   :669         NA's   :669       
     ct_000             cu_000           cv_000             cx_000        
 Min.   :     0.0   Min.   :     0   Min.   :       0   Min.   :       0  
 1st Qu.:    40.0   1st Qu.:    82   1st Qu.:   23898   1st Qu.:     944  
 Median :   210.0   Median :   278   Median : 1181117   Median :   44465  
 Mean   :   749.1   Mean   :  1223   Mean   : 1928825   Mean   :  351510  
 3rd Qu.:   672.0   3rd Qu.:   856   3rd Qu.: 2400717   3rd Qu.:  126794  
 Max.   :910366.0   Max.   :733688   Max.   :81610510   Max.   :44105494  
 NA's   :13808      NA's   :13808    NA's   :13808      NA's   :13808     
     cy_000             cz_000             da_000              db_000       
 Min.   :     0.0   Min.   :       0   Min.   :    0.000   Min.   :   0.00  
 1st Qu.:     0.0   1st Qu.:       4   1st Qu.:    0.000   1st Qu.:   0.00  
 Median :     0.0   Median :     202   Median :    0.000   Median :   0.00  
 Mean   :   274.2   Mean   :   19374   Mean   :    7.394   Mean   :  13.42  
 3rd Qu.:     0.0   3rd Qu.:    6343   3rd Qu.:    0.000   3rd Qu.:  18.00  
 Max.   :931472.0   Max.   :19156530   Max.   :21006.000   Max.   :9636.00  
 NA's   :13808      NA's   :13808      NA's   :13808       NA's   :13808    
     dc_000              dd_000           de_000             df_000        
 Min.   :        0   Min.   :     0   Min.   :     0.0   Min.   :       0  
 1st Qu.:    26558   1st Qu.:   132   1st Qu.:    66.0   1st Qu.:       0  
 Median :  1734472   Median :  1354   Median :   144.0   Median :       0  
 Mean   :  2200752   Mean   :  3124   Mean   :   375.1   Mean   :    2719  
 3rd Qu.:  2644938   3rd Qu.:  2678   3rd Qu.:   296.0   3rd Qu.:       0  
 Max.   :120759484   Max.   :445142   Max.   :176176.0   Max.   :21613910  
 NA's   :13808       NA's   :2503     NA's   :2724       NA's   :4008      
     dg_000             dh_000              di_000             dj_000        
 Min.   :       0   Min.   :        0   Min.   :       0   Min.   :     0.0  
 1st Qu.:       0   1st Qu.:        0   1st Qu.:       0   1st Qu.:     0.0  
 Median :       0   Median :        0   Median :       0   Median :     0.0  
 Mean   :    5610   Mean   :     4707   Mean   :   37248   Mean   :    39.9  
 3rd Qu.:       0   3rd Qu.:        0   3rd Qu.:       0   3rd Qu.:     0.0  
 Max.   :27064294   Max.   :124700880   Max.   :22987424   Max.   :726750.0  
 NA's   :4008       NA's   :4008        NA's   :4006       NA's   :4007      
     dk_000            dl_000              dm_000             dn_000       
 Min.   :      0   Min.   :        0   Min.   :       0   Min.   :      0  
 1st Qu.:      0   1st Qu.:        0   1st Qu.:       0   1st Qu.:    660  
 Median :      0   Median :        0   Median :       0   Median :  14330  
 Mean   :   1861   Mean   :    28542   Mean   :    7923   Mean   :  33746  
 3rd Qu.:      0   3rd Qu.:        0   3rd Qu.:       0   3rd Qu.:  27340  
 Max.   :5483574   Max.   :103858120   Max.   :23697916   Max.   :2924584  
 NA's   :4007      NA's   :4008        NA's   :4009       NA's   :691      
     do_000            dp_000           dq_000               dr_000        
 Min.   :      0   Min.   :     0   Min.   :         0   Min.   :       0  
 1st Qu.:     20   1st Qu.:     6   1st Qu.:         0   1st Qu.:       0  
 Median :  10377   Median :  2532   Median :         0   Median :       0  
 Mean   :  28508   Mean   :  6959   Mean   :   4529375   Mean   :  203760  
 3rd Qu.:  37672   3rd Qu.:  8320   3rd Qu.:         0   3rd Qu.:       0  
 Max.   :1874542   Max.   :348118   Max.   :6351872864   Max.   :50137662  
 NA's   :2724      NA's   :2726     NA's   :2726         NA's   :2726      
     ds_000            dt_000           du_000              dv_000         
 Min.   :      0   Min.   :     0   Min.   :        0   Min.   :        0  
 1st Qu.:    684   1st Qu.:   150   1st Qu.:     5380   1st Qu.:      742  
 Median :  47940   Median :  8316   Median :   185400   Median :    30592  
 Mean   :  89655   Mean   : 15403   Mean   :  4058712   Mean   :   593835  
 3rd Qu.:  99202   3rd Qu.: 17630   3rd Qu.:  3472540   3rd Qu.:   534656  
 Max.   :4970962   Max.   :656432   Max.   :460207620   Max.   :127034534  
 NA's   :2727      NA's   :2727     NA's   :2726        NA's   :2726       
     dx_000              dy_000            dz_000              ea_000        
 Min.   :        0   Min.   :      0   Min.   :   0.0000   Min.   :   0.000  
 1st Qu.:        0   1st Qu.:      0   1st Qu.:   0.0000   1st Qu.:   0.000  
 Median :        0   Median :      0   Median :   0.0000   Median :   0.000  
 Mean   :   791208   Mean   :   7780   Mean   :   0.2158   Mean   :   1.568  
 3rd Qu.:     8764   3rd Qu.:     36   3rd Qu.:   0.0000   3rd Qu.:   0.000  
 Max.   :114288420   Max.   :3793022   Max.   :1414.0000   Max.   :8506.000  
 NA's   :2723        NA's   :2724      NA's   :2723        NA's   :2723      
     eb_000               ec_00              ed_000          ee_000        
 Min.   :         0   Min.   :     0.0   Min.   :    0   Min.   :       0  
 1st Qu.:         0   1st Qu.:   114.8   1st Qu.:   98   1st Qu.:   15698  
 Median :    622110   Median :   754.4   Median :  832   Median :  260704  
 Mean   :   9717093   Mean   :  1353.1   Mean   : 1452   Mean   :  733404  
 3rd Qu.:   4000270   3rd Qu.:  1379.4   3rd Qu.: 1504   3rd Qu.:  573060  
 Max.   :1322456920   Max.   :106020.2   Max.   :82806   Max.   :74984446  
 NA's   :4007         NA's   :10239      NA's   :9553    NA's   :671       
     ee_001             ee_002             ee_003             ee_004        
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
 1st Qu.:    8536   1st Qu.:    2936   1st Qu.:    1166   1st Qu.:    2700  
 Median :  346940   Median :  233796   Median :  112086   Median :  221518  
 Mean   :  783875   Mean   :  445490   Mean   :  211126   Mean   :  445734  
 3rd Qu.:  667390   3rd Qu.:  438396   3rd Qu.:  218232   3rd Qu.:  466614  
 Max.   :98224378   Max.   :77933926   Max.   :37758390   Max.   :97152378  
 NA's   :671        NA's   :671        NA's   :671        NA's   :671       
     ee_005             ee_006             ee_007              ee_008        
 Min.   :       0   Min.   :       0   Min.   :        0   Min.   :       0  
 1st Qu.:    3584   1st Qu.:     512   1st Qu.:      110   1st Qu.:       0  
 Median :  189988   Median :   92432   Median :    41098   Median :    3812  
 Mean   :  393946   Mean   :  333058   Mean   :   346271   Mean   :  138730  
 3rd Qu.:  403222   3rd Qu.:  275094   3rd Qu.:   167814   3rd Qu.:  139724  
 Max.   :57435236   Max.   :31607814   Max.   :119580108   Max.   :19267396  
 NA's   :671        NA's   :671        NA's   :671         NA's   :671       
     ee_009            ef_000             eg_000         
 Min.   :      0   Min.   :  0.0000   Min.   :   0.0000  
 1st Qu.:      0   1st Qu.:  0.0000   1st Qu.:   0.0000  
 Median :      0   Median :  0.0000   Median :   0.0000  
 Mean   :   8389   Mean   :  0.0906   Mean   :   0.2128  
 3rd Qu.:   2028   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
 Max.   :3810078   Max.   :482.0000   Max.   :1146.0000  
 NA's   :671       NA's   :2724       NA's   :2723       
In [7]:
# Quantidade de dados da classe com valor "pos"
qtd_pos <- df_train %>%
  dplyr::filter(class == "pos")

dim(qtd_pos)
# Quantidade de dados da classe com valor "neg"
qtd_neg <- df_train %>%
  dplyr::filter(class == "neg")

dim(qtd_neg)
  1. 1000
  2. 171
  1. 59000
  2. 171
In [8]:
# Dimensão do dataset de teste
dim(df_test)
  1. 16000
  2. 171
In [64]:
# Distribuição das classes do dataset de treino
plotly::plot_ly(data = df_train, x = ~class, type = "histogram")
In [65]:
# Distribuição das classes do dataset de teste
plotly::plot_ly(data = df_test, x = ~class, type = "histogram")

Análise exploratória

Verificando dados faltantes

É possível observar que falta grande parte dos dados. Em casos extremos, conforme a análise abaixo é possível observar que algumas variáveis possuem mais de 80% dos valores ausentes. Como o conjunto de dados apresenta-se altamente desbalanceado e com muitos dados faltantes, irei fazer alguns tratamentos nos dados buscando obter melhores resultados.

In [9]:
# Verificando variáveis que possuem mais dados ausentes
naniar::gg_miss_var(df_train)

var_na <- stack(100 * colSums(is.na(df_train)) / nrow(df_train))
order_var_na <- arrange(var_na, desc(values))

order_var_na$ind <- factor(order_var_na$ind, levels = unique(order_var_na$ind)[order(order_var_na$value, decreasing = TRUE)])

# Gráfico das top 15 variáveis que possuem mais valores NA
plotly::plot_ly(order_var_na[c(1:15), ], x = ~ind, y = ~values, type = "bar") %>%
  layout(title = "Top 15 valores NA")
Warning message:
"`arrange_()` was deprecated in dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated."

Para tratar os dados faltantes, serão utilizadas as seguintes medidadas:

  1. Eliminar colunas que possui mais de 70% dos dados ausentes.
  2. Eliminar colunas que possuem valores iguais na análise descritiva.
  3. Imputar os dados nas colunas com valores faltantes com base na mediana.
In [10]:
# Removendo colunas com mais de 70% de dados ausentes
dt_train <- df_train[, -which(colMeans(is.na(df_train)) > 0.7)]
dt_test <- df_test[, -which(colMeans(is.na(df_test)) > 0.7)]

# A coluna "cd_000" é constante, portanto, será eliminada do dataset
dt_train <- subset(dt_train, select = -c(cd_000))
dt_test <- subset(dt_test, select = -c(cd_000))

# A imputação que será realizada para os outros valores nulos será preenchendo os valores nulos com a mediana
train_imp <- dt_train[, -c(1, 2)]

for (i in 1:ncol(train_imp)) {
  train_imp[is.na(train_imp[, i]), i] <- median(train_imp[, i], na.rm = TRUE)
}

df_train_imp <- cbind(dt_train[, c(1, 2)], train_imp)

# O mesmo será feito com o dataset de teste

test_imp <- dt_test[, -c(1, 2)]

for (i in 1:ncol(test_imp)) {
  test_imp[is.na(test_imp[, i]), i] <- median(test_imp[, i], na.rm = TRUE)
}

df_test_imp <- cbind(dt_test[, c(1, 2)], test_imp)

# Checando se após imputar os valores ausentes com a mediana há valores ausentes no conjunto de dados.
naniar::gg_miss_var(df_train_imp)
naniar::gg_miss_var(df_test_imp)

Verificando a correlação entre as variáveis

In [11]:
cor_matrix <- cor(x = df_train_imp[, -1], method = "spearman", use = "complete.obs")

corrplot::corrplot(cor(df_train_imp[, -1]), type = "lower", method = "ellipse", title = "Correlação entre as variáveis")

Por meio do gráfico é possível verificar que existe correlação em várias variáveis (pontos azuis no gráfico). Vamos verificar quais são as variáveis que possuem maiores correlações, ou seja, correlação acima de 0.75.

In [15]:
# Buscando as colunas que devem ser removidas para reduzir a correlação entre pares
high_cor <- caret::findCorrelation(cor_matrix, cutoff = 0.75, names = TRUE)
high_cor <- dput(high_cor)

# Quantidade de variáveis
length(high_cor)
c("af_000", "ag_006", "ah_000", "an_000", "ao_000", "ap_000", 
"aq_000", "av_000", "ay_002", "ay_003", "ay_004", "ay_006", "ay_007", 
"az_001", "az_002", "az_004", "az_005", "ba_000", "ba_001", "ba_002", 
"ba_003", "ba_004", "ba_005", "ba_006", "bb_000", "bg_000", "bh_000", 
"bi_000", "bj_000", "bt_000", "bu_000", "bv_000", "bx_000", "by_000", 
"ca_000", "cb_000", "cc_000", "ce_000", "cg_000", "ci_000", "ck_000", 
"cn_002", "cn_003", "cn_004", "cn_005", "cn_006", "cn_007", "cn_008", 
"cq_000", "cs_000", "cs_001", "cs_002", "cs_003", "cs_004", "cs_005", 
"cs_006", "ct_000", "cv_000", "dc_000", "dd_000", "dg_000", "di_000", 
"dk_000", "dm_000", "dn_000", "do_000", "dp_000", "dr_000", "ds_000", 
"dt_000", "dv_000", "dy_000", "ed_000", "ee_000", "ee_001", "ee_002", 
"ee_003", "ee_004", "ee_005", "ee_006", "ee_008", "aa_000", "ag_007", 
"ag_008", "al_000", "ag_005", "ag_002", "ag_003", "ag_004", "az_003", 
"ad_000", "cj_000")
92

Pela análise existem 92 colunas que devem ser removidas para reduzir a correlação entre pares. Portanto, elas serão removidas dos datasets de treino e teste.

In [16]:
# Eliminar variáveis com correlação alta
df_train_X <- df_train_imp %>%
  dplyr::select(!high_cor)

df_test_X <- df_test_imp %>%
  dplyr::select(!high_cor)

PCA - Análise de Componentes Principais

Como existem muitas colunas no dataset e muitas são correlacionadas, a melhor engenharia de recursos para trabalhar nesse caso será Prinicpal Component Analysis (PCA).

A decisão de quantos pontos serão utilizados será feita com base nos autovaloes superiores a 1, que indicam que a variância do componente é superior ao que representaria a variância dos dados originais. Desse modo, temos 64% da variação em 26 PCs que serão utilizadas para as análise posteriores.

In [22]:
pca <- FactoMineR::PCA(df_train_X[, -1], graph = TRUE)
autovalores <- factoextra::get_eigenvalue(pca)
factoextra::fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50))
data.table(autovalores)
A data.table: 70 × 3
eigenvaluevariance.percentcumulative.variance.percent
<dbl><dbl><dbl>
7.053518610.07645510.07646
2.7420725 3.91724613.99370
2.5857596 3.69394217.68764
2.4955765 3.56510921.25275
2.0808642 2.97266324.22542
2.0000733 2.85724827.08266
1.8066950 2.58099329.66366
1.8007853 2.57255032.23621
1.6884711 2.41210234.64831
1.6381976 2.34028236.98859
1.6050261 2.29289439.28149
1.5197969 2.17113841.45262
1.4156753 2.02239343.47502
1.3389278 1.91275445.38777
1.2436394 1.77662847.16440
1.2330741 1.76153448.92593
1.1733994 1.67628550.60222
1.1492136 1.64173452.24395
1.1326975 1.61813953.86209
1.1236653 1.60523655.46733
1.1144195 1.59202857.05936
1.0567038 1.50957758.56893
1.0549662 1.50709560.07603
1.0104705 1.44352961.51956
1.0036186 1.43374162.95330
1.0006788 1.42954164.38284
0.9997802 1.42825765.81110
0.9946709 1.42095867.23205
0.9850526 1.40721868.63927
0.9749125 1.39273270.03200
.........
0.7247694325032281.035384903575887 83.20437
0.7155919321740081.022274188819860 84.22665
0.6873401973843220.981914567691742 85.20856
0.6743194345202110.963313477885872 86.17188
0.6668783148151290.952683306878614 87.12456
0.6447346554628150.921049507803885 88.04561
0.6300043899811230.900006271401471 88.94562
0.6052818285134330.864688326447633 89.81030
0.5945105754169180.849300822024042 90.65960
0.5784061248323600.826294464046106 91.48590
0.5160187542595100.737169648942048 92.22307
0.5016058053450840.716579721921442 92.93965
0.4613016512667200.659002358952359 93.59865
0.4104876468111920.586410924015902 94.18506
0.3918469309310870.559781329901469 94.74484
0.3783957011487620.540565287355294 95.28541
0.3584409128891770.512058446984462 95.79747
0.3512498180884690.501785454412024 96.29925
0.3449134461112010.492733494444500 96.79199
0.3330642483452830.475806069064619 97.26779
0.3029510626455500.432787232350722 97.70058
0.2878280035309270.411182862186978 98.11176
0.2687300737590480.383900105370012 98.49566
0.2542544182442350.363220597491710 98.85888
0.2398246268564610.342606609794894 99.20149
0.2311479181557000.330211311650951 99.53170
0.1802471309908170.257495901415415 99.78920
0.0794932463755460.113561780536478 99.90276
0.0680694879422330.097242125631746100.00000
0.0000000023576510.000000003368073100.00000

Definindo variáveis preditoras com base nos PCs

In [25]:
# PCA no dataset de treino

# Análise de Componentes Principais
pri <- stats::prcomp(df_train_X[, -1], center = TRUE, scale = TRUE)
df_pca <- as.data.frame(pri$x)
df_pca <- cbind(df_pca, df_train_imp[1])
df_pca_best <- df_pca[, c(1:26, 71)]
head(df_pca_best)
A data.frame: 6 × 27
PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10...PC18PC19PC20PC21PC22PC23PC24PC25PC26class
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
1-0.28141000-0.1867253-0.9975129 0.003997588-1.099080824 0.001206113-0.44971638-0.016610720 0.30516356 0.36857991...-0.7453109-0.48918324-0.623210997 1.37641855-0.8560141-0.04818983-0.02772899 0.1542223-0.21675722neg
2 0.36579304-0.6458422-0.1891775-0.117397336 0.252952629 0.011370342-0.07152057 0.029305005-0.09229261-0.04022315... 0.1468104 0.03983694-0.007044987-0.07364327 0.2855417-0.17332759 0.25213556-0.2845233 0.12480145neg
3 0.67659753 0.6639205 0.3122302 0.323562164 0.001418044 0.009654439 0.03154439-0.122958674 0.06204951 0.04215777... 0.1497917-0.01634617 0.078830753-0.23536899 0.2624482-0.02841054 0.19241278-0.1234273 0.07773267neg
4 0.90435066 0.4035406 0.5488524-0.172027620-0.314101400 0.010615769 0.51204035 1.409784822 0.32437262 0.19767240... 0.5391699 0.09672687 2.420877658 0.94694674 0.1886226-0.18409547 0.57377812-0.2150863 0.14463112neg
5-0.06988492-0.2627495-0.2913568 0.016679347 0.061216180-0.006513279-0.07426337-0.007929871 0.05956510 0.14271082...-0.2745059-0.06947908 0.214225616-0.30531644 0.1586687 0.27687020-0.06683638 0.4019867-0.12687215neg
6 0.31359595 0.6940179-0.1611383 0.185425144-0.109465739 0.015486932-0.04705473 0.104372766-0.08940551 0.03120683...-0.3589317-0.33575874-0.525100783 1.11669574-1.1295019-0.15931115-0.14254345-0.1904784-0.09489683neg
In [26]:
# PCA no dataset de test
df_pca_test_best <- predict(pri, df_test_imp[-1])
df_pca_test <- cbind(df_pca_test_best, df_test_imp[1]) %>% as.data.frame()
df_pca_test <- df_pca_test[, c(1:26, 71)]
head(df_pca_test)
A data.frame: 6 × 27
PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10...PC18PC19PC20PC21PC22PC23PC24PC25PC26class
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
1 0.81758690.2190936 0.3120613 0.13291886-0.02024089 0.001320061 0.05910955-0.04698293 0.02670396-0.06053247... 0.0332368 0.015576140.09852927-0.2156389 0.175344709 0.10142619 0.01172619 0.1748030-0.0462813204neg
2 0.95806410.2971989 0.4524872 0.09860529-0.09838700 0.009655411 0.03720850-0.09607982 0.01529032-0.08043930... 0.2010897 0.018597280.06310310-0.2157059 0.210272867-0.04689384 0.16459275-0.1279674 0.0623291380neg
3-2.12112770.5936569 1.0772015-1.12343067 0.16672301-0.003479116-0.30470760-0.35044525 1.85072472-1.10482922...-0.4910670-0.047593700.08785431-0.5254141 0.177507046 0.08782064-0.61866257 0.5064439-0.1619677133neg
4 0.26319420.4857320-0.2498585 0.47189924 0.11325916 0.003809941-0.01631741-0.03334307 0.11101037 0.14646535... 0.1928581 0.194876250.15368279-0.3267906-0.008329411 0.13501939-0.09463443 0.1081056-0.0006859684neg
5 0.94412340.3178045 0.4653496 0.06705199-0.08628657 0.010644053 0.04538576-0.10653560-0.01072704-0.08320857... 0.2208804 0.020083090.06550945-0.2435541 0.225286881-0.06876114 0.18448214-0.1739655 0.0781226553neg
6 0.93858850.3038295 0.4504973 0.10058563-0.08969628 0.010494568 0.03380632-0.10707391 0.02485016-0.05430189... 0.2171964 0.015472410.06309223-0.2142951 0.219391383-0.06018550 0.19038871-0.1616250 0.0760808609neg

Tratando o desbalanceamento dos dados

Para equilibrar os dados serão utilizados 4 métodos (Undersampling, Oversampling, SMOTE e Under e Oversampling). Será executado um teste no algoritmo de Regressão Logística e aquele método que obtiver melhor resultado será o escolhido.

  1. Under e Oversampling: essa é uma técnica interessante que pode combinar uma quantidade de dados sendo aplicada à classe minoritária utilizando o oversampling para melhorar o viés, enquanto também se aplica um1- Undersampling: algumas observações da classe predominante são excluídas para balancear o conjunto de dados.
  2. Oversampling: envolve a duplicação aleatória de exemplos da classe minoritária e sua adição ao conjunto de dados de treinamento.
  3. SMOTE: gera amostras sintéticas.
  4. Under e Oversampling: essa é uma técnica interessante que pode combinar uma quantidade de dados sendo aplicada à classe minoritária utilizando o oversampling para melhorar o viés, enquanto também se aplica uma quantidade de dados à classe majoritária utilizando o undersampling para reduzir o viés nessa classe.a quantidade de dados à classe majoritária utilizando o undersampling para reduzir o viés nessa classe.
In [27]:
# Over Sampling

df_pca_best$class <- ifelse(df_pca_best$class == "neg", 0, 1)

df_over <- ROSE::ovun.sample(class ~ .,
                            data = df_pca_best,
                            p = 0.7, seed = 1,
                            method = "over"
                            )$data

table(df_over$class)
logit_over <- stats::glm(class ~ ., data = df_over, family = "binomial")
logit_over_pred <- stats::predict(logit_over, df_over, type = "response")
over_pred <- as.data.frame(ifelse(logit_over_pred > 0.5, 1, 0))
names(over_pred) <- c("class")
confusionMatrix(factor(over_pred$class), factor(df_over$class))
     0      1 
 59000 137703 
Warning message:
"glm.fit: algorithm did not converge"
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Confusion Matrix and Statistics

          Reference
Prediction      0      1
         0  48908   5952
         1  10092 131751
                                               
               Accuracy : 0.9184               
                 95% CI : (0.9172, 0.9196)     
    No Information Rate : 0.7001               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8018               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.8289               
            Specificity : 0.9568               
         Pos Pred Value : 0.8915               
         Neg Pred Value : 0.9289               
             Prevalence : 0.2999               
         Detection Rate : 0.2486               
   Detection Prevalence : 0.2789               
      Balanced Accuracy : 0.8929               
                                               
       'Positive' Class : 0                    
                                               

O método de Oversampling obteve uma precisão de 0.9184, ou seja, 91.84%.

In [28]:
# Under Sampling

data.balanced.under <- ovun.sample(class~., data=df_pca_best, 
                                   p=0.5, seed=1, 
                                   method="under")$data
table(data.balanced.under$class)
logit.under <- glm(class~., data = data.balanced.under, family = "binomial") 
logit.under.pred <- predict(logit.under, data.balanced.under, type = "response")
under.pred <- as.data.frame(ifelse(logit.under.pred > 0.5, 1, 0))
names(under.pred) = c("class")
confusionMatrix(factor(under.pred$class),factor(data.balanced.under$class))
   0    1 
 961 1000 
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 924  91
         1  37 909
                                               
               Accuracy : 0.9347               
                 95% CI : (0.9229, 0.9453)     
    No Information Rate : 0.5099               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8695               
                                               
 Mcnemar's Test P-Value : 0.000002805          
                                               
            Sensitivity : 0.9615               
            Specificity : 0.9090               
         Pos Pred Value : 0.9103               
         Neg Pred Value : 0.9609               
             Prevalence : 0.4901               
         Detection Rate : 0.4712               
   Detection Prevalence : 0.5176               
      Balanced Accuracy : 0.9352               
                                               
       'Positive' Class : 0                    
                                               

O método de Undersampling obteve uma precisão de 0.9347, ou seja, 93.47%.

In [30]:
# SMOTE 

# Transformando a variável de destino no tipo factor
df_class_factor <- df_pca_best
df_class_factor$class <- as.factor(df_class_factor$class)

data.smote <- DMwR::SMOTE(class~., df_class_factor, perc.over = 1900, perc.under = 210.53, k=5)

table(data.smote$class)
logit.smote <- glm(class~., data = data.smote, family = "binomial") 
logit.smote.pred <- predict(logit.smote, data.smote, type = "response")
smote.pred <- as.data.frame(ifelse(logit.smote.pred > 0.5, 1, 0))
names(smote.pred) = c("class")
confusionMatrix(factor(smote.pred$class),factor(data.smote$class))
    0     1 
40000 20000 
Warning message:
"glm.fit: algorithm did not converge"
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39114  2924
         1   886 17076
                                               
               Accuracy : 0.9365               
                 95% CI : (0.9345, 0.9384)     
    No Information Rate : 0.6667               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8534               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.9778               
            Specificity : 0.8538               
         Pos Pred Value : 0.9304               
         Neg Pred Value : 0.9507               
             Prevalence : 0.6667               
         Detection Rate : 0.6519               
   Detection Prevalence : 0.7006               
      Balanced Accuracy : 0.9158               
                                               
       'Positive' Class : 0                    
                                               

O método de SMOTE obteve uma precisão de 0.9365, ou seja, 93.65%.

In [31]:
# Under e Over Sampling

data.both = ovun.sample(class~., data=df_pca_best, p=0.5, seed=1, 
                        method="both")$data
table(data.both$class)
logit.both <- glm(class~., data = data.both, family = "binomial") 
logit.both.pred <- predict(logit.both, data.both, type = "response")
both.pred <- as.data.frame(ifelse(logit.both.pred > 0.5, 1, 0))
names(both.pred) = c("class")
confusionMatrix(factor(both.pred$class),factor(data.both$class))
    0     1 
29951 30049 
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 28950  3183
         1  1001 26866
                                               
               Accuracy : 0.9303               
                 95% CI : (0.9282, 0.9323)     
    No Information Rate : 0.5008               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8605               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.9666               
            Specificity : 0.8941               
         Pos Pred Value : 0.9009               
         Neg Pred Value : 0.9641               
             Prevalence : 0.4992               
         Detection Rate : 0.4825               
   Detection Prevalence : 0.5355               
      Balanced Accuracy : 0.9303               
                                               
       'Positive' Class : 0                    
                                               

A combinação dos métodos Undersampling e Oversampling obteve uma precisão de 0.9303, ou seja, 93.03%.

Analisando a matriz de confusão dos 4 métodos testado, o que apresentou o melhor resultado foi o SMOTE.

In [32]:
pca.best.smote <- DMwR::SMOTE(class~., df_class_factor, perc.over = 1900, perc.under = 210.53, k=5)
table(pca.best.smote$class)
    0     1 
40000 20000 

Treinamento

Como o problema trata-se de um problema de classificação, os seguintes modelos serão testados para ser verificado qual possui melhor resultado.

  1. SVM
  2. Random Forest
  3. XGBoost

No SVM é possível utilizar alguns valores no parâmetro kernel. Irei utilizar como teste o linear, radial e polynomial. Diferentes kernels ajustam diferentes modelos e, consequentemente, diferentes valores preditos. Ajustaremos então, um modelo para cada kernel, visando obter uma menor taxa de erro de classificação.

In [42]:
# SVM Linear

df_pca_test_class <- df_pca_test
df_pca_test_class$class <- ifelse(df_pca_test_class$class == "neg", 0, 1)
df_pca_test_class$class <- as.factor(df_pca_test_class$class)

svm.lin <- e1071::svm(class~., data=pca.best.smote, kernel='linear', cost=0.01)
summary(svm.lin)
pred_lin <- stats::predict(svm.lin, df_pca_test)
cm <- confusionMatrix(pred_lin, df_pca_test_class$class)
Call:
svm(formula = class ~ ., data = pca.best.smote, kernel = "linear", 
    cost = 0.01)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.01 

Number of Support Vectors:  11661

 ( 5829 5832 )


Number of Classes:  2 

Levels: 
 0 1


A precisão do conjunto de dados de teste para o modelo SVM com kernel linear com custo = 0,01 é 0,9738 ou 97,38%. O custo obtido por este modelo é dado por valores falso positivo 10 + falso negativo 500, ou seja, será dado pela fórmula (FP 10) + (FN 500)

In [34]:
custo_total <- function(FP, FN){
  custo <- (FP * 10) + (FN * 500)
  return(custo)
}
In [53]:
# Custo utilizando o algoritmo SVM Linear
custo <- custo_total(cm$table[2,1], cm$table[1,2])
paste0("Custo do SVM Linear é: $", custo)
'Custo do SVM Linear é: $30160'
In [46]:
# SVM com Kernel Radial

svm_rad <- e1071::svm(class~., data=pca.best.smote, kernel='radial', gamma=0.1, cost=0.1)
summary(svm_rad)
pred_rad <- stats::predict(svm_rad, df_pca_test)
cm_svm_rad <- confusionMatrix(pred_rad, df_pca_test_class$class)
cm_svm_rad
Call:
svm(formula = class ~ ., data = pca.best.smote, kernel = "radial", 
    gamma = 0.1, cost = 0.1)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  0.1 

Number of Support Vectors:  10484

 ( 4943 5541 )


Number of Classes:  2 

Levels: 
 0 1


A precisão do conjunto de dados de teste para o modelo SVM com kernel polynomial com custo = 0,01 é 0,9645 ou 96,45%. O custo obtido por este modelo é:

In [52]:
# Custo utilizando o algoritmo SVM Radial
custo <- custo_total(cm_svm_rad$table[2,1], cm_svm_rad$table[1,2])
paste0("Custo do SVM Radial é: $", custo)
'Custo do SVM Radial é: $22340'
In [54]:
# SVM com Kernel Polinomial

svm_poly <- svm(class~., data=pca.best.smote, kernel='polynomial', cost=0.01,gamma=0.1, degree=2)
summary(svm_poly)
pred_ploy <- predict(svm_poly, df_pca_test)
cm_svm_poly <- confusionMatrix(pred_ploy, df_pca_test_class$class)
cm_svm_poly
Call:
svm(formula = class ~ ., data = pca.best.smote, kernel = "polynomial", 
    cost = 0.01, gamma = 0.1, degree = 2)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  0.01 
     degree:  2 
     coef.0:  0 

Number of Support Vectors:  20408

 ( 10205 10203 )


Number of Classes:  2 

Levels: 
 0 1


In [59]:
# Custo utilizando o algoritmo SVM Polinomial
custo <- custo_total(cm_svm_poly$table[2,1], cm_svm_poly$table[1,2])
paste0("Custo do SVM Radial é: $", custo)
'Custo do SVM Radial é: $65500'

A precisão do conjunto de dados de teste para o modelo SVM com kernel radial com custo = 0,01 é 0,9765 ou 97,65%. O custo obtido por este modelo é:

In [58]:
# Random Forest

set.seed(1111)
rf <- randomForest::randomForest(class~., data = pca.best.smote, ntree = 500, mtry = 6, importance= TRUE)
pred_rf <- stats::predict(rf, df_pca_test, type = "class")
cm_rf <- confusionMatrix(pred_rf, df_pca_test_class$class)
cm_rf
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 15383    50
         1   242   325
                                               
               Accuracy : 0.9818               
                 95% CI : (0.9796, 0.9838)     
    No Information Rate : 0.9766               
    P-Value [Acc > NIR] : 0.000003857          
                                               
                  Kappa : 0.681                
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.9845               
            Specificity : 0.8667               
         Pos Pred Value : 0.9968               
         Neg Pred Value : 0.5732               
             Prevalence : 0.9766               
         Detection Rate : 0.9614               
   Detection Prevalence : 0.9646               
      Balanced Accuracy : 0.9256               
                                               
       'Positive' Class : 0                    
                                               

A precisão do conjunto de dados de teste para o modelo de Random Forest com custo = 0,01 é 0,9818 ou 98,18%. O custo obtido por este modelo é:

In [60]:
# Custo utilizando o algoritmo Random Forest
custo <- custo_total(cm_rf$table[2,1], cm_rf$table[1,2])
paste0("Custo do Random Forest é: $", custo)
'Custo do Random Forest é: $27420'
In [61]:
# XGBoost 

train_label = pca.best.smote[,'class']
train_label = as.integer(train_label) -1
train_matrix = xgboost::xgb.DMatrix(data = as.matrix(pca.best.smote[-27]), label=train_label)

test_label = df_pca_test[,'class']
test_label = as.integer(test_label) - 1
test_matrix = xgboost::xgb.DMatrix(data = as.matrix(df_pca_test[-27]), label=test_label)

set.seed(1111)
xgb <- xgboost(data = train_matrix,
               eta = 0.4,
               max_depth = 6, 
               nround=100, 
               subsample = 0.5,
               min_child_weight = 2,
               colsample_bytree = 0.5,
               seed = 1111, 
               gamma = 100,
               eval_metric = "error",
               objective = "binary:logistic",
               nthread = 3)

y_pred <- round(predict(xgb, data.matrix(df_pca_test[,-27])))

y_pred[y_pred == 0] <- c("neg")
y_pred[y_pred == 1] <- c("pos")
cm_xgb <- confusionMatrix(factor(y_pred), factor(df_pca_test$class))
cm_xgb
Warning message in xgb.train(params, dtrain, nrounds, watchlist, verbose = verbose, :
"xgb.train: `seed` is ignored in R package.  Use `set.seed()` instead."
[1]	train-error:0.069767 
[2]	train-error:0.067883 
[3]	train-error:0.064900 
[4]	train-error:0.064983 
[5]	train-error:0.061650 
[6]	train-error:0.061317 
[7]	train-error:0.059817 
[8]	train-error:0.058167 
[9]	train-error:0.057867 
[10]	train-error:0.057667 
[11]	train-error:0.057917 
[12]	train-error:0.058433 
[13]	train-error:0.057200 
[14]	train-error:0.057200 
[15]	train-error:0.057183 
[16]	train-error:0.057333 
[17]	train-error:0.057183 
[18]	train-error:0.057217 
[19]	train-error:0.057267 
[20]	train-error:0.057250 
[21]	train-error:0.057200 
[22]	train-error:0.057233 
[23]	train-error:0.057267 
[24]	train-error:0.056433 
[25]	train-error:0.056283 
[26]	train-error:0.056517 
[27]	train-error:0.056433 
[28]	train-error:0.056517 
[29]	train-error:0.056483 
[30]	train-error:0.056833 
[31]	train-error:0.056433 
[32]	train-error:0.056433 
[33]	train-error:0.056517 
[34]	train-error:0.056300 
[35]	train-error:0.056517 
[36]	train-error:0.056200 
[37]	train-error:0.056267 
[38]	train-error:0.056183 
[39]	train-error:0.056200 
[40]	train-error:0.056150 
[41]	train-error:0.056150 
[42]	train-error:0.056183 
[43]	train-error:0.056217 
[44]	train-error:0.056200 
[45]	train-error:0.056267 
[46]	train-error:0.056200 
[47]	train-error:0.056200 
[48]	train-error:0.056150 
[49]	train-error:0.056217 
[50]	train-error:0.056217 
[51]	train-error:0.056183 
[52]	train-error:0.056183 
[53]	train-error:0.056183 
[54]	train-error:0.056200 
[55]	train-error:0.056150 
[56]	train-error:0.056183 
[57]	train-error:0.056217 
[58]	train-error:0.056250 
[59]	train-error:0.056183 
[60]	train-error:0.055433 
[61]	train-error:0.055383 
[62]	train-error:0.055533 
[63]	train-error:0.055567 
[64]	train-error:0.055533 
[65]	train-error:0.055517 
[66]	train-error:0.055467 
[67]	train-error:0.055483 
[68]	train-error:0.055467 
[69]	train-error:0.055450 
[70]	train-error:0.055500 
[71]	train-error:0.055467 
[72]	train-error:0.055517 
[73]	train-error:0.055483 
[74]	train-error:0.054917 
[75]	train-error:0.054833 
[76]	train-error:0.054850 
[77]	train-error:0.054783 
[78]	train-error:0.054850 
[79]	train-error:0.054850 
[80]	train-error:0.054900 
[81]	train-error:0.054900 
[82]	train-error:0.054850 
[83]	train-error:0.054850 
[84]	train-error:0.054717 
[85]	train-error:0.054683 
[86]	train-error:0.054717 
[87]	train-error:0.054717 
[88]	train-error:0.054733 
[89]	train-error:0.054717 
[90]	train-error:0.054733 
[91]	train-error:0.054833 
[92]	train-error:0.054850 
[93]	train-error:0.054767 
[94]	train-error:0.054717 
[95]	train-error:0.054700 
[96]	train-error:0.054717 
[97]	train-error:0.054717 
[98]	train-error:0.054850 
[99]	train-error:0.054917 
[100]	train-error:0.054917 
Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 14971    22
       pos   654   353
                                             
               Accuracy : 0.9578             
                 95% CI : (0.9545, 0.9608)   
    No Information Rate : 0.9766             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.4936             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.9581             
            Specificity : 0.9413             
         Pos Pred Value : 0.9985             
         Neg Pred Value : 0.3505             
             Prevalence : 0.9766             
         Detection Rate : 0.9357             
   Detection Prevalence : 0.9371             
      Balanced Accuracy : 0.9497             
                                             
       'Positive' Class : neg                
                                             

A precisão do conjunto de dados de teste para o modelo de Random Forest é 0,9578 ou 95,78%. O custo obtido por este modelo é:

In [63]:
# Custo utilizando o algoritmo XGBoost
custo <- custo_total(cm_xgb$table[2,1], cm_xgb$table[1,2])
paste0("Custo do XGBoost é: $", custo)
'Custo do XGBoost é: $17540'

Conclusão

Apesar do modelo XGBoost não ter sido o que obteve a melhor precisão, ele foi o que apresentou o menor custo total ($17540). Ou seja, esse algoritmo foi capaz de reduzir significativamente os erros de falsos negativos, obtendo assim o baixo custo em relação aos outros algoritmos testados.